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Abstract 

The paper proposes a computationally 
feasible method for measuring context- 
sensitive semantic distance between 
words. The distance is computed by 
adaptive scaling of a semantic space. In 
the semantic space, each word in the vo- 
cabulary V is represented by a multi- 
dimensional vector which is obtained 
from an English dictionary through a 
principal component analysis. Given a 
word set C which specifies a context for 
measuring word distance, each dimen- 
sion of the semantic space is scaled up or 
down according to the distribution of C 
in the semantic space. In the space thus 
transformed, distance between words in 
V becomes dependent on the context C. 
An evaluation through a word predic- 
tion task shows that the proposed mea- 
surement successfully extracts the con- 
text of a text. 

Keywords: context-sensitivity, lexical 
similarity, semantic network, thesaurus, 
word association. 

1 Introduction 

Semantic distance (or similarity) between words 
is one of the basic measurements used in many 
fields of natural language processing, information 
retrieval, etc. Word distance provides bottom- 
up information for text understanding and gen- 
eration, since it indicates semantic relationships 
between words that form a coherent text struc- 
ture (Grosz & Sidner 86; Mann & Thompson 87); 
word distance also provides basis for episode asso- 
ciation (Schank 90), since it works as associative 
links between episodes. 

A number of methods for measuring seman- 
tic distance between words have been proposed 
in the studies of psycholinguistics, computational 
linguistics, etc. One of the pioneering works 
in psycholinguistics is the "semantic differen- 
tial" (Osgood 52), which analyzes meaning of 
words by means of psychological experiments on 



human subjects. Recent studies in computa- 
tional linguistics proposed computationally fea- 
sible methods for measuring semantic word dis- 
tance. For example, (Morris & Hirst 91) used 
Roget's thesaurus as knowledge base for deter- 
mining whether or not two words are semantically 
related; (Brown et al. 92) classified a vocabulary 
into semantic classes according to co-occurrency 
of words in large corpora; (Kozima & Furugori 
93) computed similarity between words by means 
of spreading activation on a semantic network of 
an English dictionary. 

The measurements in the former studies above 
are so-called context-free or static ones, since 
they measure word distance irrespective of con- 
texts. However, word distance changes in differ- 
ent contexts. For example, from the word car, 
we can associate related words in the following 
two directions. 

• car — > bus, taxi, railway, • ■ • 

• car — > engine, tire, seat, • • • 

The former is in the context of "vehicle" , and the 
latter in the context of "components of a car" . 
Even in free-association tasks, we often imagine 
a certain context for retrieving related words. 

In this paper, we will incorporate the context- 
sensitivity into semantic distance between words. 
A context can be specified by a word set C con- 
sisting of keywords of the context (for instance, 
C = {car, bus} for the context "vehicle"). Now 
we can exemplify the context-sensitive word as- 
sociation as follows. 

• C — {car, bus} 

taxi, railway, airplane, ■ • • 

• C= {car, engine} 

— > tire, seat, headlight, • ■ • 

Generally, if we change the context C, we will 
observe different distance for the same word pair. 
So, we in this paper will deal with the following 
problem. 

Under the context specified by a given 
word set C, compute semantic distance 
d(w, w'\C) between any two words w, w' 
in our vocabulary V . 

Our strategy for computing the context- 
sensitive word distance is "adaptive scaling of a 
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Figure 1: Mapping words onto Q- vectors 



semantic space" . Section 2 introduces the seman- 
tic space where each word in the vocabulary V is 
represented by a multi-dimensional semantic vec- 
tor. The semantic vectors, called Q-vcctors, arc 
obtained through a principal component analysis 
on P-vectors. P-vectors are generated by spread- 
ing activation on the semantic network which is 
systematically constructed from an English dic- 
tionary. Section 3 describes adaptive scaling of 
the semantic space. For a given word set C that 
specifies a context, each dimension of the seman- 
tic space is scaled up or down according to the 
distribution of C in the semantic space. In the 
semantic space thus transformed, distance be- 
tween Q- vectors becomes dependent on the given 
context. Section 4 shows some examples of the 
context-sensitive word distance computed by this 
method. Section 5 evaluates the proposed mea- 
surement through word prediction, i.e. predict- 
ing succeeding words by using preceding words 
in a text. Section 6 discusses some theoretical 
aspects of the proposed method, and Section 7 
gives conc lusion of this paper and puts this work 
in perspective. 

2 Vector-Representation of Word 
Meaning 

Each word in the vocabulary V is represented by 
a multi-dimensional Q-vector. In order to obtain 
Q-vectors, we first generate 2851-dimensional P- 
vectors by spreading activation on a semantic 
network of an English dictionary (Kozima & Fu- 
rugori 93). Next, through a principal component 
analysis on P-vectors, we map each P-vector onto 
a Q-vector with a reduced number of dimensions. 
(See Figure 1.) 

2.1 From an English Dictionary to 
P- Vectors 

Every word w in the vocabulary V is mapped 
onto a P-vector P{w) by spreading activation on 
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the semantic network. The network is systemati- 
cally constructed from a subset of the English dic- 
tionary, LDOCE {Longman Dictionary of Con- 
temporary English). The network has 2851 nodes 
corresponding to the words in LDV {Longman 
Defining Vocabulary, 2851 words). The network 
has 295914 links between the nodes — each node 
has a set of links corresponding to the words in its 
definition in LDOCE. Note that every headword 
in LDOCE is defined by using LDV only The 
network becomes a closed cross-reference network 
of English words. 

Each node of the network can hold activity, 
and it flows through the links. Hence, activat- 
ing a node of the network for a certain period of 
time causes the activity to spread over the net- 
work and forms a pattern of activity distribution 
on it. Figure 2 shows the pattern generated by 
activating the node red; the graph plots the ac- 
tivity values of 10 dominant nodes at each step 
of time. We empirically found that the activated 
pattern reaches equilibrium approximately after 
10 steps, whereas it will never reach the actual 
equilibrium. 

The P-vector P{w) of a word w is the pattern 
of activity distribution generated by activating 
the node corresponding to w. P{w) is a 2851- 
dimensional vector consisting of activity values of 
the nodes at T = 10 as an approximation of the 
equilibrium. P{w) indicates how strongly each 
node of the network is semantically related with 
w. 

We in this paper define the vocabulary V as 
LDV (2851 words) in order to make our argu- 
ment and experiments simple. Although V is not 
a large vocabulary, it covers 83.07% of 1006815 
words of the LOB corpus with the help of a mor- 
phological analysis. In addition, V can be ex- 
tended to the set of all headwords in LDOCE 
(more than 56000 words). Obviously, an LDV 
word is directly mapped onto a P-vector by 
spreading activation on the network; a non-LDV 
word can be mapped indirectly onto a P-vector by 
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Figure 3: Hierarchical clustering of P- vectors 

activating a set of the words in its dictionary def- 
inition. (Recall that every headword in LDOCE 
is defined by using LDV only.) 

The P-vector P{w) represents the meaning 
of the word w as its relationships with other 
words in the vocabulary V. Geometric dis- 
tance between two P- vectors P{w) and P{w') 
indicates semantic distance between the words 
w and w'. Figure 3 shows a part of the re- 
sult of hierarchical clustering on P-vectors, using 
Euclidean distance between centers of clusters. 
The dendrogram reflects intuitive semantic simi- 
larity between words: for instance, rat /mouse, 
tiger/lion/cat, etc. However, the similarity 
thus observed is context-free and static, and the 
purpose of this paper is to make it context- 
sensitive and dynamic. 

2.2 From P- Vectors to Q- Vectors 

Through a principal component analysis, we map 

every P-vector into a Q-vcctor on which we will 
define context-sensitive distance later. The prin- 
cipal component analysis on P-vectors provides 
a series of 2851 principal components. The most 
significant m principal components work as new 
orthogonal axes, that span m-dimcnsional vector 
space. By the m principal components, every P- 
vector (with 2851 dimensions) is mapped onto a 
Q- vector (with m dimensions). The value of m, 
which will be determined later, is much smaller 
than 2851. This means not only compression of 
the semantic information, but also elimination of 
the noise in P-vectors. 

First, we compute the principal components 
Xi, X2, ■ ■ ■ , ^2851 — each of them is a 2851- 
dimensional vector — under the following condi- 
tions. 

• For any Xi, its norm \Xi\ is 1. 

• For any Xi,Xj {i ^ j), their inner product 
iXi,Xj) is 0. 

• The variance Vi of P-vectors projected onto 
Xi is not smaller than any Vj {j > i). 

In other words, Xi is the first principal compo- 
nent which has the largest variance of P-vectors, 
and X2 is the second principal component which 
has the second-largest variance of P-vectors, and 
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Figure 4: Cumulation of Vi (percentage) 

so on. Consequently, the set of principal compo- 
nents Xi, X2, ■ ■ ■ ,^2851 provides a new orthonor- 
mal coordinate system for P-vectors. 

Next, we pick up the first m principal com- 
ponents Xi, X2, ■ • • , Xm. The principal compo- 
nents arc in descending order of their significance, 
because the variance Vi indicates the amount of 
information represented by X^. The cumulative 
variances Vi in Figure 4 shows that even a 
few hundred axes can represent more than half of 
the total information of P-vectors. The amount 
of information represented by Q- vectors increases 
with m. However, for large m, each Q- vector 
would be isolated because of overfitting — a large 
number of parameters could not be estimated by 
a small number of data. 

We estimate the optimal number of dimensions 
of Q-vcctors at m = 281 by minimizing the propor- 
tion of noise remained in Q- vectors. The amount 
of the noise is estimated by X^mis-F I*?!"*^)!) where 
F (c V) is a set of 210 function words — de- 
terminers, articles, prepositions, pronouns, and 
conjimctions. We estimated the proportion of 
noise for all m = 1, • • • , 2851 and obtained the 
minimum for m = 281. Hereafter, we will use a 
281-dimensional semantic space. 

Lastly, we map each P-vector P{w) onto a 281- 
dimensional Q-vector Q{w). The i-th component 
of Q{w) is the projected value of P{w) on the 
principal component Xf, the origin of Xi is set to 
the average of the projected values on it. We can 
ignore the direction of Xi, which determines the 
sign of projected values, since it has no effect on 
distance between Q-vectors. 

3 Adaptive Scaling of the 
Semantic Space 

Adaptive scaling of the semantic space of Q- 
vectors provides context-sensitive and dynamic 
distance between Q-vectors. Simple Euclidean 
distance between Q-vectors is not so different 
from that between P-vectors illustrated in Figure 
3; both are context-free and static distance. The 
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adaptive scaling process transforms the seman- 
tic space so as to make it adapt to a given con- 
text C. In the semantic space thus transformed, 
simple Euclidean distance between Q- vectors be- 
comes dependent on C. (See Figure 5.) 

3.1 Semantic Subspaces 

A subspace of the semantic space of Q-vectors 
works as a simple device for semantic word clus- 
tering. In a semantic subspace with the dimen- 
sions appropriately selected, the Q-vectors of se- 
mantically related words are expected to form a 
cluster. The reasons for this are as follows. 

• Semantically related words have similar P- 
vectors, as illustrated in Figure 3. 

• The dimensions of Q-vectors are extracted 
from the correlations between P-vectors by 
means of the principal component analysis. 

As an example of word clustering in the se- 
mantic subspaces, let us consider the following 
15 words. 

I. after, 2. ago, 3. before, 
4. bicycle, 5. bus, 6. car, 7. enjoy, 
8. former, 9. glad, 10. good, 

II. late, 12. pleasant, 13. railway, 
14. satisfaction, 15. vehicle. 

We scattered these words on the subspace X2 x 
X3, namely the plane spanned by the second and 
third dimensions of Q-vectors. As shown in Fig- 
ure 6, the words form three apparent clusters, 
namely "goodness" , "vehicle" , and "past" . 

However, it is still difficult to select appro- 
priate dimensions for making a semantic cluster 
for given words. In the example above, we used 
only two dimensions; most semantic clusters need 
more dimensions to be well-separated. Moreover, 
each of the 2851 dimensions is just selected or 
discarded; this ignores their strengths of contri- 
bution to forming clusters. 

3.2 Adaptive Scaling 

Adaptive scaling of the semantic space provides 
a weight for each dimension in order to form a 
desired semantic cluster; the weights are given 
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Figure 6: Clusters in the semantic subspace 
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by scaling factors of the dimensions. The method 
make the semantic space adapt to a given context 
C in the following way. 

Each dimension of the semantic space 
is scaled up or down, so as to make the 
words in C form a cluster in the seman- 
tic space. 

In the semantic space thus transformed, the dis- 
tance between Q-vectors will change with C. For 
example, as illustrated in Figure 7, when C 
has oval-shaped (generally, hyper-elliptic) distri- 
bution in the pre-scaling space, each dimension 
is scaled up or down so that C has a round- 
shaped (generally, hyper-spherical) distribution 
in the post-scaling space. This coordinate trans- 
formation changes the mutual distance among Q- 
vectors. Before scaling, the Q-vector • is closer 
to C than the Q-vector o; after scaling, o comes 
near to C, and • goes away. 



The distance d{w,w'\C) between two words 
w' under the context C = {wi, • ■ • , w„} is de- 
fined as follows. 
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where Q{w) and Q{'w') are the m-dimensional ' 
vectors of w and w' , respectively: 
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The scaling factor G [0, 1] of the i-th dimension 
is defined as follows. 



i-n {r^<l) 
(r,>l), 

SD,(C)/SD,(T/), 



where SD^ (C) is the standard deviation of the i- 
th component values of lUi, • ■ • , w„, and ST)i{V) 
is that of the words in the whole vocabulary V . 

The operation of the adaptive scaling described 
above is summarized as follows. 

• If C forms a compact cluster on the i-th di- 
mension {ri ~0), the dimension is scaled up 
[fi ~ 1) so as to be sensitive to small differ- 
ence on the dimension. 

• If C does not form an apparent cluster on i- 
th dimension (r^ ^0), the dimension is scaled 
down (fiRiO) so as to ignore small difference 
on the dimension. 

Now we can tune distance between Q-vectors 
to a given word set C which specifies the con- 
text for measuring the distance. In other words, 
we can tune the semantic space of Q-vectors to 
the context C. This tune-up procedure is not 
computationally expensive, because once we have 
computed the set of Q-vectors and SDi(F), • • ■ , 
SD,„(V^), then all we have to do is to compute 
the scaling factors /i , • • • , fm for a given word 
set C. Computing distance between Q-vectors in 
the semantic space transformed is no more expen- 
sive than computing simple Euclidean distance 
between Q-vectors. 

4 Examples of Measuring the 
Word Distance 

Let us see a few examples of the context-sensitive 
distance between words computed by adaptive 
scaling of the semantic space with 281 dimen- 
sions. Here we deal with the following problem. 

Under the context specified by a given 
word set C, compute the distance 
d{w,C) between w and C, for every 
word w in our vocabulary V. 

The distance d{w, C) is defined as follows. 
d{w,C)^-^ E d{w,w'\C), 
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Table 1: C+ from C = {bus, car, railway} 
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Table 2: C'^ from C = {bus, scenery, tour} 



The distance d{w, C) is equal to the distance be- 
tween w and the center of C in the semantic space 
transformed. In other words, d{w,C) indicates 
the distance of w from the context C. 

Now we can extract a word set C'^{k) which 
consists of the k closest words to the given con- 
text C. This extraction is done by the following 
procedure. 

1. Sort all words in our vocabulary V in the 
ascending order of d(w, C). 

2. Let C~^{k) be the word set which consists of 
the first k words in the sorted list. 

Note that C~^{k) may not include all words in C, 
even if A:>|C|. 

Here we will see some examples of extract- 
ing C~^{k) from a given context C. When the 
word set C = {bus, car, railway} is given, 
our context-sensitive word distance produces the 
cluster C+(15) shown in the Table 1. We can 
see from the listEJ that our word distance suc- 
cessfully associates related words like motor and 
passenger in the context of "vehicle". On the 
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Table 4: from C = {read, machine, memory} 



other hand, from C = {bus, scenery, tour}, the 
cluster C+(15) shown in Table 2 is obtained. We 
would see the context "bus tour" from the list. 
Note that the list is quite different from that of 
the former example, though both contexts con- 
tain the word bus. 

When the word set C — {read, paper, 
magazine}, the cluster C+(12) shown in Table 
3 is obtained. It is obvious that the extracted 
context is "education" or "study" . On the other 
hand, when C = {read, machine, memory}, the 
word set C"'"(12) shown in Table 4 is obtained. 
It seems that most of the words are related to 
"computer" or "mind". These two clusters are 
quite different, in spite of that both contexts con- 
tain the word read. 

5 Evaluation through Word 
Prediction Task 

We evaluate the context-sensitive word distance 
through predicting words in a text. When one is 
reading a text (for instance, a novel), he or she 
often predicts what is going to happen next by 
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Figure 8: Word prediction (overview) 

using what has happened already. Here we will 
deal with the following problem. (See Figure 8.) 

For each sentence in a given text, pre- 
dict the words in the sentence by using 
the preceding n sentences. 

This task is not so difficult for human adults be- 
cause a target sentence and the preceding sen- 
tences tend to share their contexts; in other words 
a target sentence and and the preceding sentences 
are in the same context. This means that pre- 
dictability of the target sentence suggests how 
successfully we extract information about the 
context from preceding sentences. 

Consider a text as a sequence Si,- ■ Sn , where 
Si is the i-th sentence of the text.p For a given 
target sentence Si, let Ci be a setB of the con- 
catenation of the preceding n sentences: 

Ci — {Si^n ■ ■ ■ Si^i}. 

Then, the prediction error of Si is computed 
as follows. 

1. Sort all the words in our vocabulary V' in 
the ascending order of d{w, Ci). 

2. Compute the average rank of Wij G Si in 
the sorted list. 

3. Let the prediction error Ci be the relative 
average rank ri/\V'\. 

Note that we here use the vocabulary V' which 
consists of 2641 words — we removed 210 func- 
tion words from the vocabulary V. Obviously, 
the prediction is successful when a « 0. 

We used O.Henry's short story "Springtime a 
la Carte" (Thornley 60) for the evaluation. The 
text consists of 110 sentences (1620 words). We 
computed the average value e of the prediction 
error for each target sentence Si (i = n + l, 
■ ■ ■, 110). For different numbers of preceding sen- 
tences (n = 1, • • • , 8) the average prediction error 
e is computed as summaried in Table 5. 

If prediction is random, the expected value of 
the average prediction error e is 0.5 (i.e. chance). 
Our method predicted the succeeding words bet- 
ter than random; the best result was observed for 
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Table 5: Average prediction error 

n = 4. Without adaptive scahng of the semantic 
space, simple Euclidean distance resulted in e = 
0.29050 for n = 4; our method is better than this 
except for n = 1. When the succeeding words 
are predicted by using prior probability of word 
occurrence, we obtained e = 0.22907. The prior 
probability is estimated by the word frequency in 
West's 5-million-word corpus (West 53). Again 
our result is better than this except for n = 1. 

6 Discussion 

6.1 Semantic Vectors 

A monolingual dictionary describes denotational 
meaning of words by using the words defined in it; 
a dictionary is a self-contained and self-sufficient 
system of words. Hence, a dictionary contains the 
knowledge for natural language processing (Wilks 
et al. 89). We represented meaning of words by 
the semantic vectors generated by the semantic 
network of the English dictionary LDOCE. While 
the semantic network ignores the syntactic struc- 
tures in dictionary definitions, each semantic vec- 
tor contains at least a part of the meaning of the 
headword (Kozima & Furugori 93). 

Co-occurrency statistics on corpora also pro- 
vides the semantic information for natural lan- 
guage processing. For example, mutual informa- 
tion (Church & Hanks 90) and n-grams (Brown 
et al. 92) can extract semantic relationships be- 
tween words. We can represent meaning of words 
by the co-occurrency vectors extracted from cor- 
pora. In spite of sparseness of corpora, each co- 
occurrency vector contains at least a part of the 
meaning of the word. 

Semantic vectors from dictionaries and co- 
occurrency vectors from corpora would have dif- 
ferent semantic information (Niwa & Nitta 94). 
The former displays paradigmatic relationships 
between words, and the latter syntagmatic rela- 
tionships between words. We should take both of 
the complementary knowledge sources into the 
vector-representation of word meaning. 

6.2 Word Prediction and Text Structure 

In the word prediction task described in Section 
5, we observed the best average prediction error 
e for n = 4, where n denotes the number of pre- 
ceding sentences. It is likely that e will decrease 
with increasing n, since the more we read the pre- 
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ceding text, the better we predict the succeeding 
text. However, we observed the best result for n 
= 4. 

Most studies on text structure assume that a 
text can be segmented into units that form a text 
structure (Grosz & Sidner 86; Mann & Thompson 
87). Scenes in a text are contiguous and non- 
overlapping units, each of which describes certain 
objects (characters and properties) in a situation 
(time, place, and backgrounds). This means that 
different scenes have different context. 

The reason why n = 4 gives the best prediction 
lies in the alternation of the scenes in the text. 
When both a target sentence Si and the preced- 
ing sentences Ct arc in one scene, prediction of 
Si from Ci would be successful. Otherwise, the 
prediction would fail. In fact, we observed peaks 
and dips in the graph of the prediction error 
plotted against the sentence position i, as shown 
in Figure 9. In addition, (Kozima 93; Kozima 
& Furugori 94) reported that 21 scenes (5.24 sen- 
tences/scene on the average) were extracted from 
the same text (O.Henry's short story) though a 
psychological experiment on human subjects. 

6.3 Towards the Model of Memory and 
Attention 

We here try to put our method in perspective to- 
wards a model of human memory and attention. 
The model should give an explanation to the fol- 
lowing human abilities. 

• To focus on the important part of informa- 
tion, and to ignore the rest of it. 

• To change the direction of attention dynam- 
ically, and to follow the current state of the 
environment. 

These abilities are required in many fields of arti- 
ficial intelligence as well as contextual processing 
of natural language. 

Let us considcir the memory system illustrated 
in Figure 10, which is intended to recall the con- 
cepts and episodes related to the current state of 
environment. The system has a short-term mem- 
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search, wc regard the adaptive scahng method as 
a model of human memory and attention that 
enables us to follow a current context, to put re- 
striction on memory search, and to predict what 
is going to happen next. 
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Figure 10: Model of memory and attention 



ory (STM) that stores the concepts or episodes 
recently recalled; the STM provides a context for 
adaptive scaling. Hence, the system will recall 
the words or episodes related to the preceding 
experiences. With the STM of limited size (for 
example, 7± 2 chunks), the system will change 
the direction of attention dynamically. 

7 Conclusion 

We proposed a context-sensitive and dynamic 
measurement of word distance computed by 
adaptive scaling of the semantic space. In the 
semantic space, each word in the vocabulary is 
represented by an m-dimensional Q-vector. Q- 
vectors are obtained through a principal compo- 
nent analysis on P- vectors. P-vectors are gen- 
erated by spreading activation on the semantic 
network which is constructed systematically from 
the English dictionary (LDOCE). The number of 
dimensions, rn = 281, is determined by evaluating 
noise remained in Q- vectors. 

Given a word set C which specifies a context, 
each dimension of the Q-vector space is scaled up 
or down according to the distribution of C in the 
semantic space. In the semantic space thus trans- 
formed, word distance becomes dependent on the 
context specified by C. An evaluation through 
predicting words in a text shows that the pro- 
posed measurement captures well the context of 
the text. 

The context-sensitive and dynamic word dis- 
tance proposed here can be applied in many fields 
of natural language processing, information re- 
trieval, etc. For example, the proposed measure- 
ment can be used for word sense disambiguation, 
in that the extracted context makes preference 
for ambiguous word senses. Also prediction of 
succeeding words will reduce the computational 
cost in speech recognition tasks. In future re- 
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